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Introduction and Outline 

The search for improved performance 
has focused on using different forms of paral- 
le.lism to achieve speed increases. 1 To this end. 
Cray Research, Inc. (CRI) introduced vector 
processing and, most recently, user-accessible 
multi-tasking (Larson. 1984, Research. 
1985, Research, 1984). The Cray work on 
multi-tasking takes a “coarse grain ” approach 
to parallelism in contrast to the “fine grain” 
parallelism of vector instruction sets or 
dataflow (Dennis, 1979). Multi-tasking was 
not introduced without tradeoffs such as this. 

The issues raised with the introduction 
of multi-tasking and multiprocessing involve 
more than performance. Multi-task programs 
may require major changes in their algorithms 
storage management, and code. Toward thF 
end. new or modified programming languages 
are needed 

Explicitly parallel languages must handle 
problems beyond the scope of conventional 
programming languages. These issues include 
data protection, non-determinism, process 
management (i.e., creation, scheduling, dele- 
tion), interprocess communication, synchroni- 
zation (i.e., deadlock and starvation), and 
error and exception handling (Denning, 1985). 
These problems are well documented with the 
Carnegie-Mellon’s multiprocessor research 
(Jones, 1980). There are few simple solu- 
tions, 2 and tradeoffs must be made. Grit and 
McGraw compare parallel applications pro- 
gramming to operating systems programming 
in sheer difficulty (Grit. 1983) thus creating 
more trouble. 

System timing must receive careful con- 
sideration in multi-task codes to avoid incon- 
sistent results and deadlock. A sequential 
code hacking style is dangerous in this 


'The terminology in varied, colorful, and highly 
confusing. Among other phrases, we have: parallel pro- 
cessing, multiprocessing, polyprocessing, distributed com- 
puting, decentralised computing, and so forth. Each 
phrase has a slightly different meaning: enough to make 
communications difficult. CRI makes the subtle distinc- 
tions that multiprogramming means multiple jobs working 
on a CPU |e.g., time-sharingj, multiprocessing means work 
done on multiple physical CPUs working multiple jobs 
|i.e„ without regard for jobs|. and multi-larking means 
multiple physical CPUs working cooperatively on a tingle 
problem. 

^ones and Gehringer specific .ily classify distributed 
system issues into problems of 1) consistency, 2) deadlock. 


environment. Care is required when dividing 
a problem into multiple tasks to avoid incon- 
sistency. This division is called partitioning or 
decomposition as well as by other terms. 

Several partitioning schemes can execute 
codes in parallel (Jones. 1980). The most 
common are pipelining, spatial partitioning (by 
problem space or machine storage), or relaxa- 
tion that removes assumptions of data con- 
sistency. David Kuck is best known for his 
research on automatic partitioning (Kuck. 
1980). This paper covers the subject of parti- 
tioning an existing application program by 
hand. 

The program “TWING” is the vehicle 
that, we use to explore the issues surrounding 
multi-tasking. This report covers: 

Existing Languages: Issues and Problems 

The Cray Multi-tasking Implementation 
The TWING Program 

Modifications to TWING 
The 2-Processor VAX Version 
The 2-Processor Cray Version 
Debugging and Other Consequences 
Performance Issues 
Discussion and Conclusion 

Our programming style is conservative and 
defensive. We assume the multi-task program 
will not. execute the first time. We chose a 
synchronous algorithm and sought results 
identical to results using uni-task TWING. 
This work stresses the importance of careful 
analysis, design, and testing. 

Existing FORTRAN Drawbacks 

As background, it. is useful to understand 
the problems inherent with standard FOR- 
TRAN and multi-tasking. FORTRAN is not 
currently designed for or intended to run in a 
parallel environment. New problems arise in 
multi-tasking such as synchronization, com- 
munication, error handling, and deadlock. An 
excellent survey of language issues and various 
attempts at solving them appears in Comput- 
ing Surveys ( Andrews , 1988). 

First, the standard FORTRAN language 
lacks process-creation primitives and struc- 
tures. The SUBROUTINE is the closest 
FORTRAN object resembling a process or a 
TASK. Second, the language lacks features 
for explicit synchronization and protection 

3) starvation, and 4) exception handling. 
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such as semaphores (Dijkstra, 1968) (i.e. , 
ALGOL-68), monitors (Hoare, 1974) (i.e., con- 
current Pascal), or rendezvous (i.e., Ada*) 
(I)OJ), 1980). It. also, lacks explicit commun- 
ication features such as mailboxes. 

Each of the aforementioned synchroniza- 
tion features has assumptions of atomicity 
(uninterruptability ) which is critical for main- 
taining a degree of consistency that standard 
FORTRAN cannot currently provide. Syn- 
chronization is a technique normally reserved 
for operating system programming (using 
libraries) since it offers “hazardous" user facili- 
ties. 4 

Lastly, the software engineering prob- 
lems associated with FORTRAN are accen- 
tuated in a multi-tasking environment. These 
problems are documented elsewhere (Dijkstra. 
1968): they include GO TOs and the lack of 
modern data structures. An example of these 
tradeoffs is the inability for Cray multi-tasking 
FORTRAN to coherently perform multiple 
RETURNS. 

It is not easy to add these features to the 
FORTRAN language. These features conflict 
with existing language semantics. The pro- 
grammer must locate and manage side effects 
on globally referenced memory (such as COM- 
MON variables), call-by-reference parameter 
passing, and manufacturer-dependent features. 
These side effects also occur at the lower 
vector-processing level: Cray users have modi- 
fied their programming style to accommodate 
them. We can similarly expect users to adopt 
a multi-tasking programming style. 

Cray Multi-tasking FORTRAN exten- 
sions 

The existing Cray Research supercom- 
puter line performs efficiently by using a vec- 
tor instruction set. Performance improvement 
is achieved by using regular data-access pat- 
terns on arrays and their indices. Currently, 
multi-tasking seeks to achieve performance 
improvement using multiple processing units. 

Cray Research has a set of primitive 
extensions to support multi-tasking in version 
].]?> of their CFT FORTRAN compiler 


5 Ada is a trademark of the Ada Joint Project Office 
of the US DOD. 

‘There exists the potential for user-induced system 
deadlock. 


(Larson, 1984). These extensions currently 
allow several virtual CPUs to execute simul- 
taneously on one to four physical CPUs. 
These primitives are invoked using subroutine 
CALLs. They are useful for creating more ela- 
borate synchronization mechanisms such as 
monitors (Hoare, 1974). 

The Cray primitives fall into three gen- 
eral categories: 

TASK creation and control 
EVENT creation and synchronization 
LOCK creation and protection 

The primitives are controlled using three basic 
data structures: a TASK control array 

(INTEGER type containing two or three ele- 
ments), EVENTs, and LOCKs (both of type 
INTEGER) all explicitly assigned (i.e., 
created). 

An extremely important semantic - '' differ- 
ence is the handling of storage (primary 
memory) in this version of FORTRAN. Local, 
storage in normal FORTRAN has a static 
allocation resulting in possible side effects. 

The new multi-tasking CFT FORTRAN 
requires a dynamic or stack-based allocation of 
storage more characteristic of ALGOL-like 
languages such as Pascal or C. This is neces- 
sary for TASK creation and migration. Local 
storage (scalars or arrays) now has a finite life- 
time and scope. A programmer can not use a 
value left over from a previous subroutine 
CALL or assume values are initialized to zero 
(0). T his is a radical departure from standard 
FORTRAN. The next four sections cover 
these primitives and their effects in greater 
detail. 

TASK Control 

W'e begin with TASK creation. A user 
controls a concurrent object called a TASK 
that is invoked like a SUBROUTINE. The 
TASK is defined like any other SUBROU- 
TINE except that its name must explicitly 
appear in an EXTERNAL statement before a 
CALL, and its storage gets handled dif- 
ferently. The specific TASK syntax primitives 
are shown in figure 1 where SUBNAME is the 
SUBROUTINE name, and 1TCA is an 
INTEGER TASK control array. Note, 


®We mention this because there are no FORTRAN 
keywords (i.e., syntax) associated with this problem: it's 
semantic. 
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CALL TSKSTART(ITCA, SUBNAME, (arguments]) 
CALL TSKWAIT(ITCA) 

Figure 1. Cray TASK primitives. 


restricted, positional SUBROUTINE argu- 
ments arc passable. 

A TASK control array is a simple data 
structure that holds TASK control data for a 
scheduler that is loaded with the program on 
execution. This scheduler is distinct from the 
operating system’s scheduler in that it governs 
user defined TASKs rather than JOBs. 

The TASK is created using the 
TSKSTART call. TSKSTART is similar to a 
fork in languages like ALGOL-68 except, a 
separate address space is created, much like a 
separate space for a FORTRAN subroutine. 
The elTcct is like a subroutine CALL with one 
major exception: subroutine CALLs are syn- 
chronous and consequently wait, unlike 
TSKSTART calls. 

The following program fragment (figure 
*2), listed in parallel, illustrates the creation of 
a TASK. Note that the subprogram allocating 
the TASK control array must not lose the 
TASK control array storage! Severe problems 
will result! 

A “TSKWAIT” statement could force a 
crude explicit synchronization on execution of 
a RETURN statement within task A. The 
section on Debugging will touch on the use- 
fulness of TSKWAIT. More refined 


synchronization is available using EVENTs 
and LOCKs. There are also TSK calls 
covered in the Cray documentation that 
report TASK information or statistics 
(Research, 1985). 

Cray support of multi-tasking includes a 
simple deadlock-detection mechanism 
Deadlock occurs when all user TASKs art 
waiting for a condition that never occurs. 
This goes for synchronization using 
TSKWAIT. EVENTs.' or LOCKs. Care is 
required, particularly, in using EVENTs 
because these functions are not neressarih 
atomic (indivisible). |Deadlock is discussed 
further in the section on Debugging.] 

EVENTs and LOCKs 

Synchronization and consistency protec- 
tion use combinations of EVENTs and 
LOCKs. Both are useful for simple synchroni- 
zation. The key difference between an 
EVENT and a LOCK is that a LOCK forces 
tasks to run in a First-In, First-Out. (FIFO) 
order. An EVENT is comparable to a “broad- 
cast.” and many TASKs can run at once. It 
is also important to clear or reset a LOCK or 
EVENT at appropriate times. 


PROGRAM 
INTEGER TA(2) 

EXTERNAL A 

CALL TSKSTART(TA, A, arguments) SUBROUTINE A(parameters) 

END END 

Figure 2. An illustration of simple TASK creation. 
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EVENTs and LOCKs are created by 
using subroutine CALLs which assign special 
protection in the same manner in which 
TASKs are created. Basic arithmetic and log- 
ical operations are disabled for these objects 
until they are released. The specific primitive 
SUBROUTINE CALLs are 

EVENT Control LOCK Control 

EVASGN(IEVAR) LOCKASGN(LCK) 

EV!>OST(IEVAR) LOCKON(LCK) 

EVWA1T(1EVAR ) LOCKOFF(LCK) 

EVCLEAR(lEVAIt) LOCKRELjLCKl 

EVREL(IEVAR) 

in which 1EVAR and LCK are INTEGERS 
assigned as EVENTs or LOCKs. The follow- 
ing is a simple two-TASK synchronization 
using EVENTs in two separate executing 
TASKs. The sco|k> is shown by the bounding 
boxes of figure 3. If an EVENT or a LOCK is 
CLEARed or RELeased while some TASK is 
waiting, the consequences are uondeterministic 
and can be disastrous. 

If combinations of EVENTs, LOCKs, 
and COMMON memory are used, it is possible 
to make more elaborate synchronization 
mechanisms such as semaphores and monitors. 
Sequential critical section s of code and data 
heed protection using these synchronization 
primitives. Problems of inconsistent synchron- 
ization are covered in the next section. 

Communications 

Communication takes place though one 
of three mechanisms: 

CALL-by-Reference parameter passing 
Global COMMON memory 


TASK COMMON memory 

Data is passed using shared (e.g., COMMON) 
variables. This is the principal means of com- 
munication and requires care in use. 

A TASK-local COMMON (e.g.. TASK 
COMMON) is available in version 1.14 of the 
CET compiler. It is similar In the more global 
COMMON except that its data is accessible 
only to objects (SUBROUTINES) within a 
particular TASK. Maintaining a consistent 
system state is a chore left to the user. 

Consistency is threatened by three basic 
hazards. Suppose A and B are two TASKs 
running in parallel and sharing a variable V. 
The hazards are based on the order in which 
processes access V: a timing problem. The 
first hazard is the rtad-wrile hazard - having 
one TASK prematurely reading a stale value 
before the appropriate write. The next is the 
writc-read hazard: having one TASK prema- 
turely “clobbering” a value before it could be 
read. The last hazard is the write- write 
hazard in which one TASK writes over values 
that never get a chance to be read (particu- 
larly difficult to detpctj f \ The Cray is not 
responsible for these potential user errors of 
timing. 

Storage and Subroutine Linkage 

The actual handling of storage differs 
vastly from conventional static FORTRAN. 
This has its greatest effect on SUBROUTINE 

‘The memory on the Henelcor Heterogeneous Pro- 
cessor (HEP) is an attempt to solve this problem. If vari- 
able* receive® ipecial declaration, they are forced to alter- 
nate reads and writes using a unique semaphore memory 
system. 




Figure 3. Synchronization of two TASKs using EVENT flags. 
Boxes represent different address spaces. 







and FUNCTION linkages. The semantics of 
these new linkages prompt some users to name 
this an entirely different language (e.g., “not- 
FORTRAN”) Old memory-saving tricks such 
as statically defined and allocated variables 
left for a second subroutine CALL are now 
undefined and may contain unreliable data. 
Users cannot assume values are initialized to 
zero (0). Expressions in parameter lists 
involve similar problems. 

Those readers familiar with dynamic 
storage management in scoped languages such 
as ALGOL, C, Pascal, or LISP should grasp 
these concepts easily. FORTRAN simply does 
not offer the protection mechanisms to ensure 
consistency of data in a multiprocess environ- 
ment. The user must actively manage the 
data consistency and program defensively. 

The Mathematical Basis for TWING 

TWING is a program that solves the 
conservative full-potential equation, using a 
fully implicit, approximate-factorization algo- 
rithm. The program solves for stable state 
airflow over a wing flying at transonic velo- 
city. TWING is the development of Dr. Terry 
Holst and Scott Thomas (Thomas, 1983) at 
the Applied Computational Aerodynamics 
Branch, NASA Ames Research Center. 

Figure 4 is a schematic of the finite 
difference mesh over which the flow solver 
operates. From this representation in “physi- 
cal space”, the problem is transformed into a 
“computational space” |figure 5| which 
preserves the orthogonality of the mesh lines 
throughout the computational domain. 


A mathematical representation of this 
flow solver is given in the derivation of equa- 
tion l.c. The three-dimensional, full potential 
equation (in x,y.z coordinates) is presented in 
equation I. a. The transformation into compu- 
tational coordinates (£,»/,C coordinates) yields 
equation l.b. In this equation, U. V, and W 
are terms composed of , 4» v , and <5>, com- 
bined with assorted metric quantities. .1 
represents the Jacobian of the transformation. 
The finite-difference approximation of this 
transformed equation (l.c) employs backward 
difference operators in the £,r ) , and ( direc- 
tions. This yields the finite-difference approxi- 
mation in equation l.c. The special density 
coefficients p , p, and p introduce an artificial 
viscosity term into the calculation. The resi- 
dual term L ( 4» ) obtained from this equation is 
used in the first step of the factorization 
scheme outlined below. 

An outline of the three-step 
approximate-factorization scheme is shown in 
the derivation of equation 2.c. In step one 
(equation 2. a), an intermediate term G(i,j) is 
computed for each point on a given “k-sheir - 
of the mesh by solving a tridiagonal linear sys- 
tem along each u line (i.e., £ — a constant) 
extending from the symmetry plane out to the 
freestreain sidewall. In step two (equation 
2.b), G(i,j) computes another intermediate 
term F(i,j,k) for each point in the “k-shcll.” 
This step requires the solution of a tridiagonal 
linear system along each £ line (i.e., constant 
T ) ) extending from the upper vortex sheet 
around the leading edge to the lower vortex 
sheet (figure 6). Finally, when F(ij.k) has 
been computed for every point in the three- 
dimensional mesh, the correction factor 


(/»*.)* + (P#,)* + ( P *,), = O(l.a) 

The three-dimensional full potential equation (x,y,z coordinates). 

[pu / j ) ( + {p v _/ •/)„-+ [pw/ J) ( = o(i.b) 


The full potential equation in computational space (£, tj , f ). 

4 +T 'k 4 


o (l-c ) 


The resultant finite-difference approximation. 


Figure 4. Sample finite difference mesh. 
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Figure 6. Computation divided into two tasks. 
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Figure 7. Computation done as a region of pencils. 
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Step 3: Correction factor C. 
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Steps in the finite differencing scheme. 


Program VTWING 
Input subroutine (INPUT) 

READ inesh 

READ run-time parameters 
Initialization subroutine (1NIT) 
initialize the solution 
compute and store metrics 
Flow Solver: (SOLVE) 
for each iteration do 
for each k-shell in mesh do 
get metrics 

compute density and density coefficients 
compute residuals 

solve for g n iti and / " , t) ,* 
end k-loop 

calculate and apply C n , j k 

output maximum residual and correction for iteration 
check convergence 
end iteration loop 
output solution 

Figure 8. Sequential structure of the TWING Program: 


C(i,j,k) is computed in step three (equation 
2.c). This calculation proceeds from the outer 
boundary down to the wing surface, requiring 
the solution of a bidiagonal system for each £ 
line (i.e., £ and tj — constants, figure 7) of the 
mesh. This correction factor is then added to 
the solution from the previous iteration, gen- 
erating a new solution. This three-st.ep pro- 
cess is repeated iteratively until convergence is 
achieved or a preset maximum iteration is 
reached. 


An outline showing the code structure 
itself is presented in figure 8. The program 
first reads the physical coordinates of the fin- 
ite difference mesh and its run-time parame- 
ters. The program then computes the metric 
quantities defining the transformation of the 
problem into “computational space” and 
writes these to disk. 

At this point, the main iteration loop of 
the program begins. The program completes 
steps one and two (equations 2. a and 2.b) of 
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n LINE ({ = CONSTANT) 



Figure 9. Computation divided in two different regions. 



the three-step approximate-factorization 
scheme outlined above operating on successive 
“k-shclls” in the mesh, beginning at the sur- 
face of the wing and progressing to the outer 
boundary. For each k-shell, the coder 

(1) fetches the appropriate subset of metrics 
from the disk 

(2) computes the density at each point 

(3) generates the special density coefficients 

(4) computes the residual terms resulting 
from equation l.c 

(5) solves for G(i,j) and F(i,j,k) 

After completing this “k-loop,” the code com- 
pletes step three of the scheme by calculating 
the correction C(i,j,k) and applies it to each 
mesh point to generate a new solution. A con- 
vergence check follows: when satisfactory con- 
vergence is achieved, the final solution is writ- 
ten to disk. 

The Modification of TWING 

TW1NG is written in portable FOR- 
TRAN 66 and executes on Cray, CDC 7600. 
and VAX CPUs. The program was rewritten 
to be well-structured. Its control flow is serial 
(i.e., few GO TOs jumping control around). 
.Although it was possible to partition the com- 
putation along functional lines in a sort of 
high-level pipeline, this approach was not pur- 
sued because it needs either substantial addi- 
tional memory or elaborate internal buffering 
to store intermediate results. Pipelining may 
also hinder efficient execution-time load- 
balancing with some stages of a pipeline exe- 
cuting longer than other stages of the pipe. 

This problem was exacerbated in 
TW1NG by the extensive use of 
EQUIVALENCE statements in the original 
code, employed in an effort to squeeze the 
largest possible problems into the limited core 
memory of a CDC 7600 or a Cray IS. Since a 
functional partitioning of the problem seemed 
unsuited to the limited shared memory avail- 
able, a static spatial-partitioning scheme was 
employed. 

Our restructuring took advantage of 
existing code and attempted as little algorithm 
change as possible. In this scheme, each step 
in the algorithm was examined in an effort to 
determine if several portions of the mesh could 
be operated on simultaneously at that step. 
Execution profiling using the Cray FLOW- 


TRACE facilities showed dominant run times 
in three SUBROUTINES. Vectorized TWING 
executed three times faster than scalar 
TWING with input-output, overhead included. 
Since distinct steps in the algorithm tend to 
correspond to separate modules in the finished 
code, this process resulted in a body of code 
that formed the skeleton of the concurrent, 
processing portion of the modified TWING. 

The calculations of the density (subrou- 
tine RO), the special density coefficients (sub- 
routine ROCO), and the residuals (subroutine 
RESID) were all split along the p axis for 
each “K-shell” in the computational mesh 
(figures 6 and 7). This resulted in splitting 
loops (figure 10). One processor generated 
these results for points on- or between the sym- 
metry plane boundary and the wingtip. The 
other processor handled points on the wing 
extension, out to the freestream sidewall boun- 
dary. This “inboard-outboard” partitioning 
scheme was chosen because the algorithm 
employed in each of these calculations is usu- 
ally constant for a given £ line' (17= a con- 
stant) but varied with position along the p 
axis. An inboard-outboard scheme was there- 
fore constructed using processor-dependent 
branches such as: 

C 

IF (TASKID .EQ. 2) GOTO 12 
DO 10 I = 1,N1M 


10 CONTINUE 

C This continue added for multi-tasking 

12 CONTINUE 


Mathematically, however, each point in the 
mesh was operated on independently during 
these preliminary calculations. We can 
replace the mesh with different divisions if 
there were reasons for favoring it. 


A more fundamental relationship 
between the underlying mathematics of the 
algorithm and the spatial decomposition of the 
problem for Multiple-Instruction stream, 
Multiple-Data stream IMIMDi execution is 
illustrated by the three-step approximate- 
factorization scheme outlined in a previous 
section. Recall that in equation 2. a, the back- 
ward differencing is performed only about rj , 
which generates tridiagonal linear systems 
along p lines (£= a constant). This makes the 
inboard-outboard partitioning scheme used 
above unworkable for this step. 
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Table 1. Execution Time Profiling 
Subroutine Vectorized TWINGj. Scalar TWING 
% Total Run Time % Total Run Time 
Rt) 15^93 14.92 

ROGO 13.45 15.91 

RES1D 23.92 17.99 

Total % 5373 48.82 

tTo clarify: this is not % of vector execution. 


C Variables declared at integer, TASKID obtained from TSKVALUE. 
IF (TASKID .EQ. 1) THEN 
ROJSTART = 2 
ROJSTOP = NJTM 
ELSEIF (TASKID .EQ. 2) THEN 
ROJSTART = NJT 
ROJSTOP = NJM 
ENDIF 

C The values of NJTM, NJT, and NJM are preset parameters 
C in uni-tasked TWING. 

C Now, eaeh process works on the j-Hnes defined by 
C the initial assignment block ■ 

C 

• DO 20 J=ROJSTARTl,ROJSTOPl 
C DO 20 J-2.NJM — old statement 
DO 15 1=1, NIM 

15 CONTINUE 
20 CONTINUE 


Figure 10. Code illustrating the splitting of a loop. 


However, adjacent tj lines are computa- 
tionally independent at this step, implying 
that the mesh could partition into "top” and 
"bottom” sections, each handled by a separate 
processor (figure 9). Similarly, in step two 
(equation 2.b). the backward differences are 
taken about £, generating tridiagonal linear 
systems along £ lines [tj = a constant), through 
the mesh. Here, each ( line is computation- 
ally independent, and the resulting tridiagonal 
systems are solved concurrently by dividing 
the mesh into the inboard and r tboard sec- 
tions described in the last paragraph (see fig- 
ure 6). 


Finally, in step three (equation 2.c), 
bidiagonal systems are generated along lines in 
the f direction (£ and g both = constants) 
(see figure 9). Again, concurrent processing or 
multiple f “pencils” is a simple and powerful 
way to use an MIMD machine at this step. 

Note that true MIMD capacity was 
required to use such a spatial partitioning 
scheme. A vector architecture alone would 
not suffice because there was no guarantee 
that the instruction stream to be executed 
would be the same at different points in the 
mesh. Split difference schemes have some- 
times proved useful. The wing root could 
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have used a more complex differencing scheme 
than employed near the outer boundary of a 
mesh. It is also possible that the values of 
some program parameters might also be posi- 
tion dependent. • 

Another code sequence commonly 
encountered in TWINCJ was the selection of 
the maximum or minimum value in an array 
following an operation on the elements of the 
array. While this search has been conducted 
in a serial mode by the main program after 
the subprocesses return, this considerably 
degraded the resulting speedup. A better 
approach was to have each subprocess locate 
the maximum or minimum element in its por- 
tion of the data base, and pass the indices of 
this value back to the main program. The 
main program needed only to compare the two 
passed elements to obtain a maximum or 
minimum over the entire data base An 

example of such a coding sequence is shown 
with in the next code section (figure II) where 
numbered variables are TASK determined and 
nonn umbered variables are global shared vari- 
ables. 

VAX Modification 

Our first MIMD testbed used two VAX 
11/780 minicomputers linked to one MA780 
multi-ported, shared memory unit. Because 
the operation of the processors was 


asynchronous, each with its own copy of the 
operating system running on a local clock, the 
configuration was best, described as a “loosely 
coupled” multiprocessor. Although each pro- 
cessor retained its large virtual address space 
as local memory, the shared memory in the 
MA780 was not virtually addressable. Each 
MA780 unit, could accommodate up to two 
megabytes of physical memory. The unit 
employed for this study was equipped with 
256 kilobytes of physical memory. 

The operating system in use at the time 
of the study was VAX/VMS (Version 3.1). 
VAX/VMS provides three facilities for inter- 
process communication across the shared 
memory link: event flags, mailboxes, and glo- 
bal data sections. 

Event flags are allocated in thirty-two 
bit clusters and are manipulated using a 
variety of system-supplied routines. A process 
could set or clear individual flags and could 
wait for the logical AND or OH of a multiple 
flag mask. One drawback of VMS-event flag 
services for MIMD programming was that the 
flag operations were not indivisible (atomic). 
This can cause difficulties when an MIMD 
program uses shared memory. It required pro- 
tection from simultaneous access by more than 
one process, especially if the number of com- 
peting processes is great. In the present study 
this problem did not arise, both because, at 


IF (ABS(RMAXl) .GE. ABS(RMAX2)) THEN 
IF (ABS(RMAXl) .GT. ABS(RMAX)) THEN 
RMAX = RMAXl 
IRMAX - IRMAXl 
JRMAX = JRMAXl 
KRMAX = KRMAXl 
END IF 

ELSE IF (ABS(RMAXl) .LT. ABS(RMAX2)} THEN 
IF (ABS(RMAX2) GT. ABS(RMAX)) THEN 
RMAX = RMAX2 
IRMAX = IRMAX2 
JRMAX - JRMAX2 
KRMAX = KRMAX2 
END IF 

END IF 

Figure 11. Selecting a maximum value from two locally determined maxima. 
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most, two processes were active simultane- 
ously and also because they generally operated 
on different parts of the statically partitioned 
data base. 

The VAX /VMS system was not 
intended to be a multiprocessor operating sys- 
tem. Programming the shared memory was 
clumsy. Since our shared memory was small, 
we reduced the resolution of the program to fit 
the space of the memory. This was a develop- 
ment measure that did not happen on our 
Cray. This paper does not cover the VAX 
specific version in any greater detail. 

The other MIMD testbed consists of a 
Cray X-MP/22 running version 1.13 of the 
Cray Operating System (COS). The Cray, by 
way of contrast, is a “tightly coupled,” shared 
memory multiprocessor. This creates prob- 
lems not faced on our VAX testbed such as 
more memory contention but simplifies pro- 
gramming. 

Cray Modifications 

The VAX version of TWINC was a 
"stripped-down” version of the production 
Cray code designed to fit into the small shared 
memory system. We, therefore, did not count 
on the VAX version to reach convergence. 

' The mesh was too coarse, and we did not get 
a chance to truly debug the VAX version. 
The mathematical basis for partitioning the 
vector version of TWING (VTW1NG) was 
identical to the VAX-specific version. This 
time, we sought realistic, convergence. Debug- 
ging was a major problem not only for 
TWING, but also for the new STACK alloca- 
tion and multi-tasking of the CFT compiler we 
were testing. 

One important side step, was a quick set 
of checks regarding the new SUBROUTINE 
linkages. We should mention this was not a 
problem for TWING. To do this, a user com- 
piled the complete, existing program using the 
ALLOC=STACK option on the new. CFT 
compiler. The program was then run using 
the associated new loader given adequate 
stack and heap sizes (see the manual) 
(Research, 1985). The results were compared 
to the original STATICally compiled run. A 
useful variation of this was to create simple 
TASKs that START then immediately W'AIT 
as a CALL to a SUBROUTIF - would: 


change from: to: 

CALL RO CALL TSKSTART(TA.RO) 

CALL TSKWAIT(TA) 

The timing differences between STATIC and 
STACK runs are included in the section on 
Performance. The compiler changes do 
effect program execution without source code 
changes. 

The next stage entailed converting the 
existing code into a multi-tasking body of 
code. This was not as easily as it appeared as 
subtle errors required detection and correction. 
It is possible to do this at different levels or 
stages such as converting the entire program, 
converting subroutines, or converting blocks of 
code. Converting a code in large sections is 
like writing a large program and expecting it 
to run correctly the first time. 

It was important to have good com- 
parison data, since fast execution did not 
imply correct execution. A machine-readable 
output was created from an unmodified, run- 
ning version of TWING. Once the code was 
running, we tested the output of the multi- 
task run with our uni-task output using a dif- 
ferential file comparator (the UNIX 7 difj pro- 
gram). This insured that our conversion was 
precise. 

Our third and last attempt at conversion 
was to break a subroutine into two smaller 
subroutines: a parallel portion and a serial 
portion. Since most of the data was stored in 
COMMON blocks, parameter passing was 
minimized to simplify these problems. The 
parallel subroutines were run and synchronized 
before the serial portion as shown in Figure 
12 . 

Portions of serial subroutine code (typi- 
cally loops) then migrated to the parallel sub- 
routines. This technique successfully identi- 
fied subscripting oversights, branching prob- 
lems, and so on. It was painfully slow, but it 
was effective. Initially, task synchronization 
was performed using TSKST ART and 
TSKWA1T, not the more complex EVENT 
flags. We used the “Make it right before you 
make it faster” philosophy from the The Ele- 
ments of Programming Style (Kernighan, 
1978). 

We stress the following point: make cer- 
tain that the existing code is bug-free. There 


r UNIX is a trademark of AT&T Bell Laboratories. 
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take: 


becomes: 


CALL S | SUB S 


CALL P) 
CALL S2 


SUB Si a 
SUB S2 


SUB Sib 


Figure 12. Code migration from serial into 
parallel, where S is the subprogram, the num- 
bered portions refer to the halves (1 and 2} of 
S. Pi represents the set of CALLs that are in- 
voked for parallel TASKs Sla and Sib. 


is little sense trying to multi-task bug ridden 
code. Multi-tasking the code made programs 
harder to debug. The programmer has to dis- 
tinguish the original bugs from the newly 
introduced linkage and multi-task bugs. 

Each SUBROUTINE was individually 
converted to two parallel TASRs giving three 
versions of the program. The next step was to 
get combinations of two different TASKs run- 
ning within a program. This was used to 
locate side effects between any two different 
TASKs. We still used the crude START and 
WAIT CALLs at this point. Finally, we had 
all three CALLs converted. 

Once all TASKs were operating using 
crude synchronization, it was a simple matter 
to get barrier synchronization using EVENTs. 
We moved one TASK at a time to EVENT 
structures. After EVENTs replaced the 
TSKSTART and TSKWAIT CALLs, we 
wrote a simple user-level TASK scheduler (fig- 
ure 13) that worked on simple message- 
passing 

Our last act scaled the grid from VAX 
shared memory-size to Cray memory, produc- 
tion size. During this final work, we corrected 
one V AX-scale dependency problem. This 
problem involved a partial correctness proof 
mentioned further in the section on Debug- 
ging- 

Time and Effort 

This work took several months. W'e 
reported our many compiler problems to CR1. 
Meanwhile, Cray Research migrated from 
CFT Release 1.13 to 1.14. solving many of our 
problems. 


To reiterate the degree of change, 
TWING actually consisted of two separate 
programs*: a grid generator and the vectorized 
version of the TWING flow solver. The 
multi-tasking took place only on the flow 
solver. 

We document, the GRIDGEN program 
here only for completeness. The GRIDGEN 
program consisted of 

2123 total lines of commented FORTRAN 
1195 executable lines of code in 
1031 executable statements 

An instrumented uni-task version of the 
TWING solver consisted of 

3926 lines of commented code 
3840 lines without instrumentation 
2529 total executable lines 
1906 executable statements 

An instrumented multi-task version of 
TWING came to 

4450 lines of commented code 
4399 lines without instrumentation 
2870 lines total executable 
2188 executable statements 

Note that additions and modifications do not 
sum to the totals because there is overlap. 
Additions and modifications took the form of 
replication and addition of statements to han- 
dle problems such as parameter passing. 

Our experience with converting this and 
other NASA codes. |LES and ARC3D] 
currently has us modifying about 10% of the 
code (if the grid generator is counted, slightly 
more if not). Most of these codes have fewer 


*The two programs are combined as one for 
machines with large memory. 
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MESG - 1 
CALL SCHED 

SUBROUTINE SCHED 

CALL EVPOST(GO) 

CALL EVWAIT(DONE) 

END 


SUBROUTINE PROCES 

CALL EVWAIT(GO) 
IF(MESC.EQ.l) THEN 
CALL HO 

CALL EVPOST(DONE) 
END 


Figure 13. Structure of our simple scheduler. 


loops split across processors compared to 
TWING. We split a total of 19 loops in three 
SUBROUTINES. This includes new code for 
loop splitting, new per-process branches, 
TASK-EVENT creation and control code, and 
a small TASK scheduler. About 210 lines of 
control flow code were added (excluding com- 
ments). 70 more lines were replaced or modi- 
•fied into 160 lines to handle problems of 
parameter passing, or changes to array indices. 

During the development of each TASK, 
good version control proved useful. A good 
tool requires parallel branching versions; linear 
version control such as UPDATE was not ade- 
quate. Maintaining the successful, intermedi- 
ate stages of multi-task TWING made debug- 
ging and scale-up easier through the isolation 
of changes. It was always possible to easily 
fall back to some parallel, executable code. 

Debugging 

Sequential debugging is generally 
regarded as a black art. Bugs occur during 
compile-time and run-time: with the latter, 
the non-fatal ones are the hardest to find. 
The basic techniques for debugging are 
categorized into: 1) traces, 2) snapshots or 3) 
dumps. These techniques have problems in 
multiprocess environments lacking consistency 
or having deadlock. Multi-task debugging is 
plagued by a lack of reproducib ; ' : ‘ • synchron- 
ization, and good tools. T c literature on 
run-time debugging in multiprocess environ- 
ments is scarce (Model, 1979) and more work 


is needed in this area. 

Numerous users tell us to “force multi- 
task execution into a single stream of execu- 
tion 9 ” as if simple user-controlled reduction 
would solve hazard problems. 


This does not help! 


Normal debugging depends on a machine 
being in a reasonably consistent state. A 
multi-task program crash may not occur al 
the same location as with a uni-task program. 
This is true for uniprocessors executing multi- 
task code as well. 

Consider a simple example to illustrate 
the conceptual difficulties of debugging using 
the CFT tracebaek facility. A program 
creates a child TASK. When the child TASK 
dies, should the tracebaek trace through the 
point where the child process began, or should 
it trace through the synchronization routines 
(if any)” The tangled nondeterministir web 
makes this decision difficult. There are situa- 
tions where one trace is preferable over the 
other. One condition is when the child dies 
because of the actions of its parent or sibling 
processes j side effects!. So. traces are not sim- 
ple. What about snapshots? 


’This is accomplished using the TSKTUNE cal! and 
letting the MAXCPU parameter to ‘1.’ 
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Inserting WRITE statements into pro- 
grams might not help. First, the execution 
order of these statements may vary (e.g., non- 
determinism). Second, I/O is another shared 
resource, and the user must have LOCKs that 
protect that resource like any other shared 
resource 

One surprising effect of inserting 
WRITE statements at. key points was the 
migration of bugs from one location to 
another! We solved this debugging problem 
by modifying our technique of migrating code 
between serial and parallel development sub- 
routines. Our new technique was to remove 
data structures and code immediately follow- 
ing the breakage point to isolate program and 
compiler bugs. This sometimes worked to 
locate bugs. The problem at this point 
becomes: is the program crashing because of 
the original bug or the bugs introduced by 
cutting code? 

In the I/O locking process, it would help 
users debug codes if the system could hide I/O 
locking details from users. Better yet, a small 
library of simple routines would help. It 
should have traceable ERROR and ASSER- 
TION routines. If a user resorts to adding 
WRITE statements to follow the execution of 
a program, the user should have a similar 
trace of a serial code for advanced comparison. 
A simple Filter could take a source program 
and insert a WRITE with the subprogram’s 
name. More elaborate and more powerful 
debugging tools would also help. 

Dumps, the method of last resort, are 
frequently less consistent than traces or break- 
points. We avoided dumps at all cost. 

One technique tried in the latter stages 
of multi-task conversion was program proving. 
Toward completion of program scale-up, we 
had a tricky change to a SUBROUTINE call. 
Precondition and postcondition assertions were 
compiled surrounding critical code changes. 
Proof techniques had limitations in a parallel 
environment, but it was a useful technique for 
checking changes. Program proving was not 
regarded as a cure-all and was regarded as 
controversial. 

The last set of problems involve syn- 
chronization and timing. A new diagnostic 
message for first-time multi-tasking program- 
mers is compressed from a real CRAY job in 
figure 14. Race conditions occur whenever 
two or more TASKs or processors are sharing 


data (or code). This is the time when 
deadlock can occur. There are no general solu- 
tions. but there is a mountain of research 
literature. Multi-tasking CFT provides lim- 
ited deadlock detection and traceback. Keep- 
ing TASK scheduling and timing constraints 
simple is currently the best way to avoid 
deadlock. The most difficult deadlock prob- 
lems should occur when there are indirect 
deadlocks. 

Testing Multiprocessor Outputs 

A running multi-task program was not 
enough; we sought numerical results identical 
to our uni-task TW1NG. There were many 
occasions where our program ran to comple- 
tion, but our numbers did not agree at. lesser 
digits of precision. A standard file comparator 
was used to test output between TWING 
runs. The importance of tools such as a good 
file comparator was not underestimated. A 
single, incorrect, boundary subscript could 
“poison” an entire array. Testing asynchro- 
nous methods (e.g., chaotic relaxationj is more 
difficult. 

Fortunately, our program is completely 
synchronous. However, newer asynchronous, 
chaotic algorithms remove the consistency 
assumption and approximate a solution. If 
such asynchronous methods are used, file com- 
parator programs are completely inadequate. 
Better comparison tools are needed. Output 
testing tools must approximate floating-point 
comparisons within a specified tolerance. 

The Gray multi-task version of TWING 
had proved our concept by reaching conver- 
gence with results identical to a uni-task ver- 
sion of vectorized TWING. 

Other Generally Useful Tools 

While mentioning debugging tools, we 
should also mention other generally useful 
tools. Among these we could include tools to 
search for STATIC allocation and data depen- 
dence. Data dependence tools can also pro- 
vide help when recursion is added to FOR- 
TRAN. A good cross-referencing tool could 
aid this search process. Other tools could pos- 
sibly identify linkage problems. Such tools are 
useful in the analysis and compilation phases 
of development. All these programs should 
execute independently (i.e., from a compiler) 
in the style of other good software tools. 
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USER UT024 - DEADLOCK - ALL USER TASKS WAITING FOR LOCKS OR EVENTS 

USER TB001 - BEGINNING OF TRACEBACK 

USER - $TRBK WAS CALLED BY UTERP% AT 1715731a 

USER - UTERP% WAS CALLED BY $SUSTSK% AT 1705027a 

USER - $SUSTSK% WAS CALLED BY EVWAIT AT 1701511b 

USER TB002 - END OF TRACEBACK 

Figure 14. A frequent error message for new users of multitasking. 


Performance and Execution Behavior 

The measurement of parallel programs is 
conceptually complicated by several factors. 
The Cray measurement facilities, if used, 
record the length of all parallel execution 
traces as if they were measured sequentially. 
For instance, two cycles run in parallel take 
one cycle to execute, but they are still counted 
as two cycles. The Cray documentation 
(Research. 1985) notes that flow tracing facili- 
ties do not work properly with multi-tasking 
environments. We resort to the direct use of 
the system real-time clock and flow tracing of 
the uni-task version of TWING to give us 
run-time characteristics. 

There are no standard metrics for deter- 
mining multiprocessor performance improve- 
ment. The most common in use is |simple; 
speed up defined by: 

, . Serial execution time 

Simple speed-up = — — ; 

Parallel execution time 

The simple speed up of TWING is illustrated 
in the next table. 

Another conceptual measurement prob- 
lem is where and how measurements are 
taken. We simply throw two CPUs at a prob- 
lem. so the maximum simple speed-up is one- 
half the total serial execution time. I/O wait 
time is a significant portion of the program 
that cannot multi-task. We recognize that we 
don’t use two CPUs for the entire time: we 
have serial code, and we have wait-tiine for 
TASKs to finish and synchronize. Also, we 
need more cycles to cover overhead. 

Since we were able to multi-task only 
50% of the total serial < .cution, the best 
improvement we could gain would be 25% of 
total execution. We might term this 


performance figure as proportional, simple, 
speed-up. As we multi-task more code, this 
figure should slowly increase. 

Still another problem is that with two or 
more CPUs sharing common resources - 
memory and I/O — collisions become inevit- 
able. Processors are forced to wait, and this 
expends more overhead cycles. This conten- 
tion is visible when running a uni-task version 
of the code in one processor, and running a 
second code in another processor. By varying 
the work load in the second program between 
a CPU intensive versus memory intensive 
JOB, we can see the simple, but significant 
effects of memory contention (See the Table 
below). These are interference effects not 
found on uniprocessors. A problem arises in 
shared memory multiprocessors such as on our 
VAX and Cray that local memory multipro- 
cessors do not have. Memory contention sig- 
nificantly slows down memory performance. 
Designers of future multiprocessors must bal- 
ance processor- versus memory-performance 
rates. 

Another performance issue is the addi- 
tion of overhead cycles required t.o control 
TASKs. Figure 15 shows the cost in cycles 
versus the iterations toward solution for our 
VAX version. This cost occurs similarly on 
the Cray. 

Load balancing is a significant problem 
since TASKs vary in work load, and we have 
seen that measurement of load has problems. 
The output from the Cray day files shows a 
considerable imbalance of work. Cray tools 
discovered that TASK 1 did more work (exe- 
cuted significantly longer) than TASK 2. The 
timing output of a single day-file illustrates 
the difference on our two CPU system: 
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Table 3. Uni-task TWING Execution Times (in Seconds) 



Low Memory Contention 

High Memory Contention 

• 

STATIC compile 

STACK compile 

STATIC compile 

STACK compile 

Real | Wall] Time 

■ . 

7.36 

8.49 

8.04 

Total-System Time 


6.76 

8.03 

7.35 

Subroutines: 

lnput| 

0.0244 

0.0255 

0.0264 

0.0276 

Init| 

0.250 

0.210 

0.270 

0.225 

Solve 

7.08 

6.52 

7.73 

7.09 

RO 

1.20 

1.06 

1.35 

1.13 

ROCO 

0.995 

0.893 

1.07 

0.970 

RES1D 

1.77 

1.74 

1.91 

1.89 

!These SUBROUTINES were not con- 
verted to use multi-tasking. They are 
included here for control reasons to 
show the effect of changing to a 
STACK compilation. 





TASK 

CP TIME 

1 

11.85 

2 

4.83 


This is because the work areas were not parti- 
tioned evenly between the two T ASKs based 
on hand analysis of array pr< ^rtions. Work 
was partitioned based on existing, somewhat 
lopsided DO-loop parameters in three- 


dimensional arrays. 

To change these parameters would 
require more computation and potentially 
further array-subscript- change. Additional 
algorithm modifications are required for boun- 
dary regions. Dynamic load balancing is 
harder still. 
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Discussion 
Further Research 

This research has not. covered other 
forms of multiprocess partitioning. Pipelines 
are a common proposal: easily constructed and 
debugged, but difficult to tune or load bal- 
ance. (See Scale-Up.) The program’s author 
(Thomas) is considering this approach, but. it 
requires extensive rewriting. 

Micro- tasking is another Cray-proposed 
multiprocessing construct. (Booth, 1985). 
Micro-tasking involves a simpler, more restric- 
tive set of control primitives. Another impor- 
tant issue is the area of scale-up (See next sec- 
tion). 

Scale-up 

Certain aspects of scaling up programs 
are trivial. Increasing problem size is not typ- 
ically a problem: our VAX case was not a 
necessary prerequisite to move the program to 
the Cray. Adding more processors, however, 
is not trivial. The work on the TWING code 
began before there was any consideration of 
generalizing the program to use more than two 
processors. 

The current multi-task work on TWING 
will not generalize to an n -processor case. 
The code used to determine maxima is one 
problem that will not easily scale. If more 
than two processors are used, different parti- 
tioning schemes become preferable. 

Probably the key issue of multi-tasking 
is whether the performance gained was worth 
the effort expended. There is a conflicit (or 
tradeoff) between the need to have large 
multi-task sections for performance and small 
multi-lesk sections for ease of development 
and debugging. 

The multi-tasking programmer must also 
confront the need to have large protected criti- 
cal sections and many asynchronous processes 
running. Our scale-up of the code uncovered 
many machine-dependent assumption prob- 
lems. For the scale-up of code, the parallel- 
serial divide-and-conquer approach again 
worked. 

Open Issues 

The problems of automatic partitioning 
are not addressed in this study. Our future 
intent is to extend FORTRAN by using a 


simple preprocessor to add support for simpler 
constructs (e.g., COBEGIN, COEND) like 
Cray micro-tasking. The preprocessor should 
ideally hide low-level details and machine 
dependent processing. It is tempting for pro- 
grammers to be parochial about particular 
constructs, so we wish to avoid this by using 
preprocessors. Similar research is under study- 
on different architectures at other sites (e.g., 
LANL, ANL, Bell Labs, CMU, U. of 111.). 

There are dozens of issues left open: dif- 
ferent synchronous and asynchronous algo- 
rithms. translation into an intermediate 
language for dataflow-style execution, meas- 
urement and load balancing. Parallel process- 
ing has many difficult problems remaining 
which will take years to research. 

Conclusions 

The introduction of parallelism is as sig- 
nificant a tool as either Cray multi-tasking or 
micro-tasking. The problems of parallelism 
are not new. They are typically thought to 
inhabit that realm called systems program- 
ming. Users intending to add parallelism to 
their collection of tools are advised to learn 
from experience of others. 

Good software tools would help program- 
mers. These tools must provide multiprocess- 
ing support. Many programmers would prob- 
ably desire a standardization of multiprocess- 
ing syntax, but this is premature. 

Programmers should recognize that, with 
adding parallelism and achieving better perfor- 
mance, there will come some loss of the 
coherent sequence that makes sequential pro- 
gramming such a powerful tool. 

Programs designed to use parallelism 
from their inception are more likely to use 
parallelism efficiently. This was clearly the 
case with the introduction of vectorization, 
i.e., vector-designed programs tend to use vec- 
tors more efficiently. We should soon see 
more multi-task programs, but it is an open 
question whether these programs are scale-able 
into the hundred- and thousand- (proposed) 
processors range. 
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